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Abstract 



O , 

O^J I The estimation of a sparse vector in the linear model is a fundamental problem in signal 

■ processing, statistics, and compressive sensing. This paper establishes a lower bound on the 

mean-squared error, which holds regardless of the sensing/design matrix being used and re- 
gardless of the estimation procedure. This lower bound very nearly matches the known upper 
bound one gets by taking a random projection of the sparse vector followed by an l\ estimation 
procedure such as the Dantzig selector. In this sense, compressive sensing techniques cannot 
essentially be improved. 

Keywords: Compressive sensing, sparse estimation, sparse linear regression, minimax 
lower bounds, Fano's inequality, matrix Bernstein inequality 

VP ■ 1 Introduction 

> 

^-j- ■ The estimation of a sparse vector from noisy observations is a fundamental problem in signal 

CN ! processing and statistics, and lies at the heart of the growing field of compressive sensing [4,5,8]. 

At its most basic level, we are interested in accurately estimating a vector x G W 1 that has at most 

k non-zeros from a set of noisy linear measurements 

y = Ax + z, (1) 

where A 6 jj mxn anc j z ^ A^(0,c 2 i"). We are often interested in the underdetermined setting 
where m may be much smaller than n. In general, one would not expect to be able to accurately 
recover x when m < n since there are more unknowns than observations. However it is by now 
well-known that by exploiting sparsity, it is possible to accurately estimate x. 

As an example, consider what is known concerning i\ minimization techniques, which are among 
the most powerful and well-understood with respect to their performance in noise. Specifically, if 
we suppose that the entries of the matrix A are i.i.d. M(0, then one can show that for any 
x G Sfc := {x : \\x\\ Q < k}, i\ minimization techniques such as the Lasso or the Dantzig selector 
produce a recovery x such that 

1 ka 2 

— \\x — x\\o < Co logn (2) 

n m 

holds with high probability provided that m = £1 (klog(n/k)) [6]. We refer to [3] and [9] for further 
results. 
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1.1 Criticism 



A noteworthy aspect of the bound in (2) is that the recovery error increases linearly as we decrease 
m, and thus we pay a penalty for taking a small number of measurements. Although this effect is 
sometimes cited as a drawback of the compressive sensing framework, it should not be surprising 
- we fully expect that if each measurement has a constant SNR, then taking more measurements 
should reduce our estimation error. 

However, there is another somewhat more troubling aspect of (2). Specifically, by filling the rows 
of A with i.i.d. random variables, we are ensuring that our "sensing vectors" are almost orthogonal 
to our signal of interest, leading to a tremendous SNR loss. To quantify this loss, suppose that 
we had access to an oracle that knows a priori the locations of the nonzero entries of x and could 
instead construct A with vectors localized to the support of x. For example, if m is an integer 
multiple of k then we could simply measure sample each coefficient directly m/k times and then 
average these samples. One can check that this procedure would yield an estimate obeying 




Thus, the performance in (2) is worse than what would be possible with an oracle by a factor of 
(n/k)logn. When k is small, this is very large! Of course, we won't have access to an oracle in 
practice, but the substantial difference between (2) and (3) naturally leads one to question whether 
(2) can be improved upon. 



1.2 Can we do better? 

In this paper we will approach this question from the viewpoint of compressive sensing and/or of 
experimental design. Specifically, we assume that we are free to choose both the matrix A and the 
sparse recovery algorithm. Our results will have implications for the case where A is determined 
by factors beyond our control, but our primary interest will be in considering the performance 
obtained by the best possible choice of A. In this setting, our fundamental question is: 

Can we ever hope to do better than (2)? Is there a more intelligent choice for the matrix 
A? Is there a more effective recovery algorithm? 

In this paper we show that the answer is no, and that there exists no choice of A or recovery 
algorithm that can significantly improve upon the guarantee in (2). Specifically, we consider the 
worst-case error over all x € i.e., 



M*(A) = inf sup E 



- ||x(y)-aj||2 
n 



(4) 



Our main result consists of the following bound, which establishes a fundamental limit on the 
minimax risk which holds for any matrix A and any possible recovery algorithm. 

Theorem 1. Suppose that we observe y = Ax + z where x is a k-sparse vector, A is an m x n 
matrix with m > k, and z ~ N(0,o~ 2 I). Then there exists a constant C\ > such that for all A, 

krr 2 

M*(A) > Ci— - log (n/k) . (5) 
\\A\\ F 



We also have that for all A 



krr 2 

M*(A) > Jjjj-. (6) 
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This theorem says that there is no A and no recovery algorithm that does fundamentally better 
than the Dantzig selector (2) up to a constant 1 ; that is, ignoring the difference in the factors log n/k 
and logn. In this sense, the results of compressive sensing are at the limit. 

Although the noise model in (1) is fairly common, in some settings (such as the estimation of 
a signal transmitted over a noisy channel) it is more natural to consider noise that has been added 
directly to the signal prior to the acquisition of the measurements. In this case we can directly 
apply Theorem 1 to obtain the following corollary. 

Corollary 1. Suppose that we observe y = A(x + w) where x is a k-sparse vector, A is an mx n 
matrix with k < m < n, and w ~ M(0, a 2 1). Then for all A 

krr 2 krr 2 
M*(A) > d— log (n/k) and M*(A) > — . (7) 
m m 

Proof. We assume that A has rank m! < in. Let UJjV* be the reduced SVD of A, where U 
is m x m', S is m' x ml, and V is n x m' . Applying the matrix XI -1 U* to y preserves all the 
information about x, and so we can equivalently assume that the data is given by 

y' = T,- 1 U*y = V*x + V*w. (8) 

Note that V*w is a Gaussian vector with covariance matrix a 2 V*V = a 2 I. Moreover, V* has 
unit-norm rows, so that ||V*||^ < m! < m. We then apply Theorem 1 to establish (7). □ 

The intuition behind this result is that when noise is added to the measurements, we can boost 
the SNR by rescaling A to have higher norm. When we instead add noise to the signal, the noise 
is also scaled by A, and so no matter how A is designed there will always be a penalty of 1/m. 



1.3 Related work 

There have been a number of prior works that have established lower bounds on M*(A) or related 
quantities under varying assumptions [1, 13-17, 19]. In [1, 17], techniques from information theory 
similar to the ones that we use below are used to establish rather general lower bounds under 
the assumption that the entries of x are generated i.i.d. according to some distribution. For an 
appropriate choice of distribution, x will be approximately sparse and [1, 17] will yield asymptotic 
lower bounds of a similar flavor to ours. 

The prior work most closely related to our results is that of Ye and Zhang [19] and Raskutti, 
Wainwright, and Yu [15]. In [19] Ye and Zhang establish a bound similar to (5) in Theorem 1. While 
the resulting bounds are substantially the same, the bounds in [19] hold only in the asymptotic 
regime where k — > oo, n — > oo, and - — > 0, whereas our bounds hold for arbitrary finite values of k 
and n, including the case where k is relatively large compared to n. In [15] Raskutti et al. reach a 
somewhat similar conclusion to our Theorem 1 via a similar argument, but where it is assumed that 
A satisfies |[^4.a?|| 2 < (1 + 5) \\x\\ 2 for all x £ S2fc, (i.e., the upper bound of the restricted isometry 

Our analysis shows that asymptotically Ci can be taken as 1/128. We have made no effort to optimize this 
constant, and it is probably far from sharp. This is why we give the simpler bound (6) which is proven by considering 
the error we would incur even if we knew the support of a; a priori. However, our main result is (5). We leave the 
calculation of an improved constant to future work. 
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property (RIP)). In this case the authors show that 2 

krr 2 

M*(A)>C^ T ^log(n/k). 

Our primary aim, however, is to challenge the use of the RIP and/or random matrices and to 
determine whether we can do better via a different choice in A. Our approach relies on standard 
tools from information theory such as Fano's inequality, and as such is very similar in spirit to the 
approaches in [1, 15, 17]. The proof of Theorem 1 begins by following a similar path to that taken 
in [15]. As in the results of [15], we rely on the construction of a packing set of sparse vectors. 
However, we place no assumptions whatsoever on the matrix A. To do this we must instead consider 
a random construction of this set, allowing us to apply the recently established matrix- version of 
Bernstein's inequality due to Ahlswede and Winter [2] to bound the empirical covariance matrix 
of the packing set. Our analysis is divided into two parts. In Section 2 we provide the proof of 
Theorem 1, and in Section 3 we provide the construction of the necessary packing set. 



1.4 Notation 

We now provide a brief summary of the notations used throughout the paper. If A is an m x n 
matrix and T C {1, . . . ,n}, then A? denotes the m x \T\ submatrix with columns indexed by T. 
Similarly, for a vector x € R" we let x\t denote the restriction of x to T. We will use to 
denote the standard £ p norm of a vector, and for a matrix A, we will use || A\\ and || A\\ F to denote 
the operator and Frobenius norms respectively. 



2 Proof of Main Result 

In this section we establish the lower bound (5) in Theorem 1. The proof of (6) is provided in the 
Appendix. In the proofs of both (5) and (6), we will assume that a = 1 since the proof for arbitrary 
a follows by a simple rescaling. To obtain the bound in (5) we begin by following a similar course 
as in [15]. Specifically, we will suppose that x is distributed uniformly on a finite set of points 
X C Sfc, where X is constructed so that the elements of X are well separated. This allows us to 
apply the following lemma which follows from Fano's inequality combined with the convexity of 
the Kullback-Leibler (KL) divergence. We provide a proof of the lemma in the Appendix. 

Lemma 1. Consider the measurement model where y = Ax + z with z ~ A/"(0, J). Suppose that 
there exists set of points X = {xi} i=1 C such that for any Xi, xj G X , \\X{ — Xj\\ 2 > 8nM* (A) , 
where M*(A) is defined as in (4)- Then 

1 1 W 

-log|^| - 1 < — — ^ XT \\Axi-AxjWl (9) 

I I i,j=l 



2 Note that it is possible to remove the assumption that A satisfies the upper bound of the RIP, but with a rather 
unsatisfying result. Specifically, for an arbitrary matrix A with a fixed Frobenius norm, we have that || A\\^ < \\A\\ 2 F , 
so that (1 + 8) < \\A\\ F - This bound can be shown to be tight by considering a matrix A with only one nonzero 
column. However, applying this bound underestimates M*(A) by a factor of n. Of course, the bounds coincide 
for "good" matrices (such as random matrices) which will have a significantly smaller value of S [14]. However, the 
random matrix framework is precisely that which we wish to challenge. 
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By taking the set X in Lemma 2 below and rescaling these points by AyJ nM*(A), we have that 
there exists a set X satisfying the assumptions of Lemma 1 with 

\X\ = {n/k) k/4 , 

and hence from (9) we obtain 



7 \ X \ ( ( \ 

- log (n/k) - 2 < — , W Ax * ~ Ax jWl = Tr I A* A I ^ (s 

I I i,j=l \ \\ I i,j=l 

If we set 

: 1*1 : W 

A* = -r^r «i and Q = — - XiX*. 

' ' i=l ' ' i=l 



(10) 



then one can show that 



—j ^ - ^i) ( x i - x o)* = 2 (Q 

Thus, we can bound (10) by 



X 



2Tr (A* A (Q - fi/j,*)) < 2Tr (A* AQ) , 

where the inequality follows since Tr(A*A/x/x*) = \\An\\l ^ °- Moreover, since A* A and Q are 
positive semidefinite, 

Tr (A*AQ) < Tr (A* A) \\Q\\ = \\A\\ 2 F \\Q\\ . 

Combining this with (10) and applying Lemma 2 to bound the norm of Q — recalling that it has 
been appropriately rescaled — we obtain 

k - log (n/k) - 2 < (1 + /3)32M*(A) || A\\ 2 F , 
where /3 is a constant that can be arbitrarily close to 0. This yields the desired result. 



3 Packing Set Construction 

We now return to the problem of constructing the packing set X. As noted above, our construction 
exploits the following matrix Bernstein inequality of Ahlswede and Winter [2]. See also [18]. 

Theorem 2 (Matrix Bernstein Inequality). Let {Xi} be a finite sequence of independent zero-mean 
random self-adjoint matrices of dimension n x n. Suppose that \\Xt\\ < 1 almost surely for all i 
and set p 2 = £V ||E [X 2 ] \\ . Then for all t G [0, 2p 2 }, 

F 

We construct the set X by choosing points at random, which allows us to apply Theorem 2 to 
establish a bound on the empirical covariance matrix. In bounding the size of X we follow a similar 
course as in [15] and rely on techniques from [11]. 



> t 



< 2nexp 



4p 2 



(11) 



5 



Lemma 2. Let n and k be given, and suppose for simplicity that k is even and k < n/2. There 
exists a set X = {xjjj^ C of size 

\X\ = {n/k) k/i (12) 

such that 

(i) \\Xi — XjW^ > 1/2 for all Xi,Xj G X with i ^ j; and 

(H) lA^S^-H - P/n ' 
where (3 can be made arbitrarily close to as n — > oo. 

Proof. We will show that such a set X exists via the probabilistic method. Specifically, we will 
show that if we draw \X\ independent /c-sparse vectors at random, then the set will satisfy both (i) 
and (ii) with probability strictly greater than 0. We will begin by considering the set 



U 



{x G |o,+a/TA, -a/Ta}™ : \\x\\ Q = /e} . 



Clearly, \U\ = iX)2 k . Next, note that for all x,x' G U, ^\\x' — x\\ Q < \\x' — x\\ 2 , and thus if 
\\x' — as 1 1 2 < 1/2 then \\x' — x\\ Q < k/2. From this we observe that for any fixed x G U, 

[x' eU : ||a5 / - sbIIJ ^ V 2 }| < \{x' eU : \\x' - x\\ Q < k/2}\ < Qj 2 )3 fc/2 . 

Suppose that we construct X by picking elements of U uniformly at random. When adding the 
j th point to X, the probability that Xj violates (i) with respect to the previously added points is 
bounded by 

a-l)(^ 2 )3 fc/2 

Thus, using the union bound, we can bound the total probability that X will fail to satisfy (i), 
denoted Pi, by 

1 W-iW 2 , \x?& fVs\ k 



< 



Next, observe that 



k/2 



(I) _{k/2)\(n-k/2)\ T^n-k + i 



n \ 
;/2> 



\k/2 



k\(n-k)\ 



n 



L k/2 



> 



Kk/2) 
(I) 



n — k + 



k/2\ 



k/2 



k/2 + k/2 J 



n_l\ k/2 
k 2) 



where the inequality follows since (n — k + i)/{k/2 + i) is decreasing as a function of i provided 
that n-k> k/2. Also, 



n 



\k/2 ( y/% 



kJ 



3n\ 
4k) 



k/2 



< 



n_l\ h/2 
k 2 J 



with the proviso k < n/2. Thus, for \X\ of size given in (12), 



1 {Jjl(Vt\ k < l_(n_ l\ fc / 2 (£) 1 fe) 



2 \k) 



(I) 



2 \ k 



ik) 2 {k/2) (fc) 



< 



(13) 
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Next, we consider (%%). We begin by letting 



Xi — x^x^ 

n 



Since £Ej is drawn uniformly at random from U, it is straightforward to show that ||-X"j|| < 1 and 
that E [a^a?*] = I/n, which implies that E [Xi] = 0. Moreover, 



E [X*] = E { Xi x*) 2 



n 



{n -I) 



n- 



Thus we obtain p 2 = ||E [X 2 ] \\ = \X\ (n - l)/n 2 < \X\ /n. Hence, we can apply Theorem 2 

to obtain 

' 1*1 

Xi > t < 2n exp 

i=l 



t 2 n 



Setting t = \X\ j3/n, this reduces to show that the probability that X will fail to satisfy (ii), denoted 
P2, is bounded by 

(3 2 \X\ 



P<2 < 2nexp 



An 



For the lemma to hold we require that P\ + P2 < 1, and since Pi < | it is sufficient to show 
that P 2 < \. This will occur provided that 

r, 4nlog(4n) 

P > 



X 



Since \X\ = ((n//c) fc ), (3 can be made arbitrarily small as n — > 00. 



□ 



Appendix 

Proof of (6) in Theorem 1. We begin by noting that 



M*(A)=inf sup sup E 

x T:\T\<k x:supp(x)=T 



1 w-f ^ 112' 
-\\x(y)-x\\ 2 



> sup inf sup E 

T:\T\<k x a;:supp(£c)=T 



-\\x(y)-x\\ 2 2 

n 



Thus for the moment we restrict our attention to the subproblem of bounding 



M*{A T ) = inf sup E 

x £c:supp(a;)=T 



-\\x(y)-x\\ 2 2 

n 



inf sup E 



1 1 , , . s n 9 

— \\x{ Atx + z) — x L 
n 



(14) 



where x(-) takes values in M. k . The last equality of (14) follows since if supp(a;) = T then 

\\x(y) - x\\l = \\x(y)\ T - x\ T \\\ + p(y)| Tc ||2 , 

so that the risk can always be decreased by setting x{y)\x<^ = 0. This subproblem (14) has a 
well-known solution (see Exercise 5.8 on pp. 403 of [12]). Specifically, let Aj(A^Ay) denote the 
eigenvalues of the matrix A^At- Then 



(15) 
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Thus we obtain 



M*(A)> sup M*(A T )= sup -J2 rrrli , 

T:|T|<fc T:|T|<fc ra ~j M A T A Tj 



(16) 



Note that if there exists a subset T for which Aj> is not full rank, then at least one of the eigenvalues 
Aj(A^Ar) will vanish and the minimax risk will be unbounded. This also shows that the minimax 
risk is always unbounded when m < k. 

Thus, we now assume that At is full rank for any choice of T. Since f(x) = 1/x is a convex 
function for x > 0, we have that 



> 



k 2 



k 2 



- \(A T A T ) ~ Y!1=i\{A t At) \\A t \\ 2 f 

Since there always exists a set of k columns To such that ||At ||jt < (k/n) \\A\\ F , (16) reduces to 
yield the desired result. 

□ 

Proof of Lemma 1. To begin, note that if x is uniformly distributed on the set of points in X , then 
there exists an estimator x(y) such that 



E, 



- P(y) - x \\l 

n 



< M*{A), 



(17) 



where the expectation is now taken with respect to both the signal and the noise. We next consider 
the problem of deciding which Xi G X generated the observations y. Towards this end, set 

T(x(y)) = argmin \\x(y) - x»|| 2 . 

Define P e = P [T(x(y)) ^ x]. Prom Fano's inequality [7] we have that 

fl-(x|y)<l + P e log|^|. (18) 

We now aim to bound P e . We begin by noting that for any xi G X and any x(y), T{x{y)) ^ X{ 
if and only if there exists an Xj S X with j such that 

||a;(y) 3E?i || 2 ^ (("^(y) J 1 1 2 — II""' ""ill 2 ll""(t/) ""mI2 ' 

This would imply that 



2 \\x(y) - Xi\\ 2 > \\xi - xj\\ 2 > Y / 8nM*(A). 
Thus, we can bound P e using Markov's inequality as follows: 



P e < 



\x{y) - Xi\\l > 8nM*(A)/4 



E, 



< 



\ X \V) ~ x ih 



< 



nM*(A) 1 



2nM*(A) ~ 2nM*(A) 2 
Combining this with (18) and the fact that H(x) = log |^|, we obtain 

I(x, y) = H(x) - H(x\y) > i log |^| - 1. 
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Prom the convexity of KL divergence (see [10] for details), we have that 



I(x,y)< 

where D (7^ , ) represents the KL divergence from V% to Vj where V% denotes the distribution of y 
conditioned on x = X{. Since z ~ A/"(0, 1), V% is simply given by N(Ax{, I). Standard calculations 
demonstrate that D (VijVj) = \ \\A.Xi — AxjW^, establishing (9). □ 
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